These are notes from current progress on the LimoRhyde → fGSEA pipe. Note that development is very much ongoing, and everything is subject to change.
Original file location for this document: /code/R/notes
Since I haven’t formally written out the rationale behind this pipe, I’ll write it here. Previously, we have attempted several pipes. One is the circadian gene identification → pathway analysis using non-expression based tools link, another is differential analysis at every time point → pathway analysis using fGSEA link. The former pipe is well underway. The latter pipe has hit some serious snags - the outputs are interesting, but much work remains to be done before any conclusions can be drawn mostly because the outputs are not yet readable or interpretable in a meaningful way. This current notebook is a part of that analysis - I hope that, by using an analysis that combines differential rhythmicity and differential expression, we can better setup our pathway analysis.
LimoRhyde is reported to be a tool that can identify rhythmic, differentially rhythmic, and differentially expressed genes all in one package .
This notebook was me following the LimoRhyde vignette and applying it to our dataset. The tutorial can be viewed via browseVignettes('limorhyde').
Some of the major challenges included
I took some time to generate an eset from our data. I did this because I thought this was required by LimoRhyde, and also because it seems like an appropriate formal method to store data, especially for later when the dataset becomes published.
I used the following resources to guide me through this:
The script where this was run is eset_conversion.R.
I modelled much of the eset structure off the LimoRhyde vignette eset. To view this data set, load the package and run this command.
eset = getGEO(filename = system.file('extdata', 'GSE34018_series_matrix.txt', package = 'limorhyde'))
View(eset)
The location for where the eset data is stored is /data/expressionset/
/data/expressionset/
│
├─eset.Rds eset R dataset
├─metadata.xlsx eset metadata in xlsx
├─metadata.csv generated from metadata.xlsx
├─phenoData.xlsx eset phenoData in xlsx
├─phenoData.csv generated from phenoData.xlsx
Libraries imported.
library(Biobase)
library(magrittr)
library(dplyr)
library(DT)
dataDirectory = /code/R/notes → /data/expressionset/
The feature counts data can be read in as is to eset@assayData.
This is enough to build a ‘minimal’ eset.
dataDirectory <- "../../../data/expressionset/"
exprs <- read.csv2("../../../KF_RNASeq_counts_filtered.txt", sep = ",", header = T, row.names = 1) %>%
set_rownames(.$Geneid) %>%
dplyr::select(-Geneid) %>%
as.matrix()
minimalSet <- ExpressionSet(assayData = exprs)
phenoData and metadata are read in, and an AnnotatedDataFrame object is built using both.
pData <- read.table(paste0(dataDirectory, "phenoData.csv"), row.names = 1, header = T, sep = ",")
metadata <- read.table(paste0(dataDirectory, "metadata.csv"), row.names = 1, header = T, sep = ",")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
phenoData <- new("AnnotatedDataFrame",
data = pData, varMetadata = metadata)
Let’s take a look at the object structure.
phenoData
## An object of class 'AnnotatedDataFrame'
## rowNames: KF_01 KF_02 ... KF_72 (72 total)
## varLabels: sample_number title ... ZT_time (38 total)
## varMetadata: labelDescription
We can view the phenoData.
phenoData@data %>%
DT::datatable()
The metadata has the same number of rows as phenoData’s columns. It contains a description of each phenoData column name so people working with the data know what you’re referring to by, e.g., “KO” in a column.
phenoData@varMetadata %>%
DT::datatable()
experimentData is simply information about the experiment itself.
experimentData <- new("MIAME",
name=c("Sumeed Yoyo Manzoor","Katya Frazier"),
lab="Chang Lab",
contact="smanzoor@uchicago.edu",
title="WT vs L-Bmal1-KO in SPF vs GF conditions RNA Seq data",
abstract="FILLER TEXT",
url="https://changlab.uchicago.edu ",
other=list(
notes="Created from excel sheets and text files"
))
Finally, the formal eset is built using the above items.
eset <- ExpressionSet(assayData=exprs,
phenoData=phenoData,
experimentData=experimentData,
annotation="NA")
It can be saved in the data directory.
saveRDS(eset, paste0(dataDirectory, "eset.Rds"))
And loaded in future sessions.
test <- readRDS(paste0(dataDirectory, "eset.Rds"))
A brief look at the eset data.
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 55573 features, 72 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: KF_01 KF_02 ... KF_72 (72 total)
## varLabels: sample_number title ... ZT_time (38 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: NA
The eset is missing some data that we will need to fill in later:
eset@phenoData@data the following columns
extract_protocol RNA extraction protocol usedlabel_protocol RNA labelling protocolhyb_protocol Hybridization protocol, probably not relevant to RNA Seqscan_protocol Also probably not relevant to RNA Seqdata_processing Data processing to get featurecountsplatform_id Illumina platformeset@protocolDataeset@featureDataeset@annotation Illumina platformLibraries are imported.
library('annotate')
library('data.table')
library('foreach')
library('GEOquery')
library('ggplot2')
library('ggpubr')
library('knitr')
library('limma')
library('limorhyde')
library('org.Mm.eg.db')
Some functions were sourced from LimoRhyde.
source(system.file('extdata', 'vignette_functions.R', package = 'limorhyde'))
However, I didn’t end up using any of them.
Some parameters set. These values haven’t yet been reviewed, just set following the vignette’s instructions.
period = 24
qvalRhyCutoff = 0.15
qvalDrCutoff = 0.1
dataDirectory <- "../../../data/expressionset/"
eset = readRDS(paste0(dataDirectory, "eset.Rds"))
phenoData is read into sm, which is then selected only for sampple, condition, and Zeitgeber time.
sm = data.table(pData(phenoData(eset)))
sm = sm[, .(title, sample = sample_number, cond = genotype_microbe, time = ZT_time)]
setorderv(sm, c('cond', 'time'))
LimoRhyde then runs Fourier analysis of time to produce time_cos and time_sin.
sm = cbind(sm, limorhyde(sm$time, 'time_'))
sm %>%
DT::datatable()
The vignette at this point obtains log-transformed expression values from eset@featureData. Instead, I just use featurecounts data eset@assayData$exprs.
emat = eset@assayData$exprs
Using limma , rhythmic genes are obtained.
rhyLimma = foreach(condNow = unique(sm$cond), .combine = rbind) %do% {
design = model.matrix(~ time_cos + time_sin, data = sm[cond == condNow])
fit = lmFit(emat[, sm$cond == condNow], design)
fit = eBayes(fit, trend = TRUE)
rhyNow = data.table(topTable(fit, coef = 2:3, number = Inf), keep.rownames = TRUE)
setnames(rhyNow, 'rn', 'geneId')
rhyNow[, cond := condNow]
}
rhyLimmaSummary = rhyLimma[, .(P.Value = min(P.Value)), by = geneId]
rhyLimmaSummary[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(rhyLimmaSummary, 'adj.P.Val')
rhyLimmaSummary %>%
DT::datatable()
All genes are used in statistical tests, but only results for rhythmic genes are kept.
design = model.matrix(~ cond * (time_cos + time_sin), data = sm)
fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
drLimma = data.table(topTable(fit, coef = 5:6, number = Inf), keep.rownames = TRUE)
setnames(drLimma, 'rn', 'geneId')
drLimma = drLimma[geneId %in% rhyLimmaSummary[adj.P.Val <= qvalRhyCutoff]$geneId]
drLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(drLimma, 'adj.P.Val')
drLimma %>%
DT::datatable()
This is a linear model of cond alone.
design = model.matrix(~ cond + time_cos + time_sin, data = sm)
fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
deLimma = data.table(topTable(fit, coef = 2, number = Inf), keep.rownames = TRUE)
setnames(deLimma, 'rn', 'geneId')
deLimma = deLimma[!(geneId %in% drLimma[adj.P.Val <= qvalDrCutoff]$geneId)]
deLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(deLimma, 'adj.P.Val')
# deLimma$geneId = rownames(deLimma)
# rownames(deLimma) = NULL
# deLimma = anti_join(deLimma, filter(drLimma, adj.P.Val <= qvalDrCutoff), by = 'geneId')
# deLimma$adj.P.Val = p.adjust(deLimma$P.Value, method = 'BH')
deLimma %>%
DT::datatable()
A volcano plot of \(\log_{2}\,\text{fold change}\) differentially expressed genes x \(\log_{10}\,q_{DE}\) (adj p val) shows many significant results with both positive and negative expression fold change.
ggplot(deLimma) +
geom_point(aes(x = logFC, y = -log10(adj.P.Val)), size = 0.2, alpha = 0.5) +
labs(x = expression(log[2]*' fold-change'), y = expression(-log[10]*' '*q[DE])) +
geom_hline(yintercept = -log10(0.05)) +
geom_text(aes(-700000, -log10(0.05), label = "adj p val = 0.05", vjust = -1, fontface = 3)) +
theme_pubr()
Zooming in
trim_val <- 1000
deLimma %>%
subset(.$logFC < trim_val) %>%
subset(.$logFC > -trim_val) %>%
ggplot() +
geom_point(aes(x = logFC, y = -log10(adj.P.Val)), size = 0.2, alpha = 0.5) +
labs(x = expression(log[2]*' fold-change'), y = expression(-log[10]*' '*q[DE])) +
geom_hline(yintercept = -log10(0.05)) +
geom_text(aes(-1000, -log10(0.05), label = "adj p val = 0.05", vjust = -1, fontface = 3)) +
theme_pubr()
The top rhythmic gene, top differentially rhythmic gene, and top differentially expressed gene is pulled.
I get gene names from ENSEMBL via an AnnotationDbi query see page 4.
geneIdsNow = c(rhyLimmaSummary$geneId[1], drLimma$geneId[1], deLimma$geneId[1])
geneSymbolsNow = AnnotationDbi::select(org.Mm.eg.db, keys=geneIdsNow, keytype = 'ENSEMBL', columns = c('SYMBOL')) %>% dplyr::select(SYMBOL) %>% unlist() %>% unname()
Some information on these genes.
AnnotationDbi::select(org.Mm.eg.db, keys=geneIdsNow, keytype = 'ENSEMBL', columns = c('SYMBOL','GENENAME')) %>%
mutate(desc = c('top rhythmic gene', 'top differentially rhythmic gene', 'top differentially expressed gene')) %>%
DT::datatable()
A plot of expression by time in each condition.
df = data.table(t(emat[geneIdsNow, ]))
setnames(df, geneSymbolsNow)
df[, sample := colnames(emat[geneIdsNow, ])]
df = merge(df, sm[, .(sample, cond, time)], by = 'sample')
df = melt(df, measure.vars = geneSymbolsNow, variable.name = 'geneSymbol',
value.name = 'expr')
levels(df$geneSymbol) = geneSymbolsNow
ggplot(df) +
facet_grid(geneSymbol ~ cond, scales = 'free_y') +
geom_point(aes(x = time, y = expr, shape = cond, color = geneSymbol), size = 2) +
labs(x = 'Zeitgeber time (h)', y = 'Expression (norm.)') +
scale_shape(solid = FALSE, guide = FALSE) +
scale_color_brewer(type = 'qual', palette = 'Dark2', guide = FALSE) +
scale_x_continuous(limits = c(0, 24), breaks = seq(0, 24, 4))
Thus, these
LimoRhyde works, which is great. I can throw the data into it, and the package data works as expected. There are some issues that need to be addressed asap, however.
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Mm.eg.db_3.8.2 limorhyde_0.1.1 limma_3.40.6
## [4] knitr_1.28 ggpubr_0.2.5 ggplot2_3.3.0
## [7] GEOquery_2.52.0 foreach_1.4.8 data.table_1.12.8
## [10] annotate_1.62.0 XML_3.99-0.3 AnnotationDbi_1.46.1
## [13] IRanges_2.18.3 S4Vectors_0.22.1 DT_0.13
## [16] dplyr_0.8.5 magrittr_1.5 Biobase_2.44.0
## [19] BiocGenerics_0.30.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 tidyr_1.0.2 assertthat_0.2.1 digest_0.6.25
## [5] R6_2.4.1 RSQLite_2.2.0 evaluate_0.14 pillar_1.4.3
## [9] rlang_0.4.5 blob_1.2.1 rmarkdown_2.1 labeling_0.3
## [13] splines_3.6.3 readr_1.3.1 stringr_1.4.0 htmlwidgets_1.5.1
## [17] RCurl_1.98-1.1 bit_1.1-15.2 munsell_0.5.0 compiler_3.6.3
## [21] xfun_0.12 pkgconfig_2.0.3 htmltools_0.4.0 tidyselect_1.0.0
## [25] tibble_2.1.3 codetools_0.2-16 crayon_1.3.4 withr_2.1.2
## [29] bitops_1.0-6 grid_3.6.3 jsonlite_1.6.1 xtable_1.8-4
## [33] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 scales_1.1.0
## [37] stringi_1.4.6 farver_2.0.3 ggsignif_0.6.0 xml2_1.2.5
## [41] vctrs_0.2.4 RColorBrewer_1.1-2 iterators_1.0.12 tools_3.6.3
## [45] bit64_0.9-7 glue_1.3.2 purrr_0.3.3 hms_0.5.3
## [49] crosstalk_1.1.0.1 yaml_2.2.1 colorspace_1.4-1 memoise_1.1.0
smanzoor@uchicago.edu